Team members

Leon Harper (21385662)

Thomas Newton (21365654)

Michal Jedruszczak (21440496)

These are the team members for the group project.

Initial set-up

Step 1: Problem statement

World of Warcraft Problem Statement

World of Warcraft is a popular MMORPG (Massively multiplayer online role-playing game) video game with millions of user every month. Due to having this many players there is bound to be some cyberbullying/toxic players included in those millions of players, World of Warcraft however is especially toxic and is often ranked as one of the most toxic gaming community’s today. In a survey by ADL (Anti-Defamation League) it was found 66% of adults ages 18-45 have been harassed/bullied in World of Warcraft in 2021 (Hate is No Game: Harassment and Positive Social Experiences in Online Games 2021).

The objective of this project is to create a model that will be able to detect cyberbullying/toxicity. By using the World of Warcraft dataset provided to us it will allow the model to have reference for comments deemed as bullying.

This would be a classification model that when give a comment/statement would decided whether it is bullying or not bullying, it will be able to do this by detecting certain words and phrases.

Currently most video games have an option to filter chat, however this only censors certain words/phrases. Our model will be able to detect strings of words rather than just certain ones.

League of Legends problem statement

League of Legends is a popular MOBA (Multiplayer online battle arena) video game with millions of users every month. Due to having this many players there is bound to be some cyberbullying/toxic players included in those millions of players, League however is especially toxic and is often ranked as one of the most toxic gaming community’s today. The current ban rate of all accounts as of September 2022 is 2.25% (Around 2,632,500 account). In a survey by ADL (Anti-Defamation League) it was found 65% of adults ages 18-45 have been harassed/bullied in League of Legends in 2021.

The objective of this project is to create a model that will be able to detect cyberbullying/toxicity. By using the League of legends dataset provided to us it will allow the model to have reference for comments deemed as bullying.

This would be a classification model that when give a comment/statement would decided whether it is bullying or not bullying, it will be able to do this by detecting certain words and phrases.

Currently most video games have an option to filter chat, however this only censors certain words/phrases. Our model will be able to detect strings of words rather than just certain ones.

ADL Survey: Hate is No Game: Harassment and Positive Social Experiences in Online Games 2021 (adl.org) # Step 2: Importing data

wow_posts_df <- read.csv("Data/posts_wow.csv")
wow_annotations_df <- read.csv("Data/annotations_wow.csv")

lol_posts_df <- read.csv("Data/posts_lol.csv")
lol_annotations_df <- read.csv("Data/annotations_lol.csv")

This imports the required data for the project. The data was exported from an SQL script that creates the necessary tables (i.e. posts and annotations) and the data. To simplify the process of importing data, we used the table export wizard to export the SQL table data into csv files using custom SQL as the MySQL Workbench Table Export Wizard doesn’t export all of the data properly.

Step 2.5: Preliminary EDA

We are doing a preliminary EDA in order to understand how we should clean the data and the kind of data that we are dealing with.

League of Legends analysis

summary(lol_posts_df)
##     topic_id       post_number      author          html_message      
##  Min.   :   2.0   Min.   :   0   Length:16867       Length:16867      
##  1st Qu.:  33.0   1st Qu.: 364   Class :character   Class :character  
##  Median : 140.0   Median : 953   Mode  :character   Mode  :character  
##  Mean   : 604.6   Mean   :1393                                        
##  3rd Qu.:1090.0   3rd Qu.:1998                                        
##  Max.   :2481.0   Max.   :5358                                        
##   timestamp        
##  Length:16867      
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
nrow(lol_posts_df)
## [1] 16867

There are 16867 data points in the League of Legends posts data.

lol_posts_df %>% select("topic_id") -> lol_topics #gets all unique lol_topics from the dataset
nrow(unique(lol_topics))
## [1] 17

This data set has a total of 17 different topics.

World of Warcraft analysis

In our World of Warcraft posts dataset, we are given 16978 rows of 5 features: topic_id, post_number, author, html_message and timestamp. For the purposes of this project, we will ignore timestamp, as it will not be used to train our models. The additional WoW annotations dataset reveals which messages were flagged as cyberbullying, from which the only useful features are the post_number and topic_id. This information has been combined into a single file – clean_posts_balanced_sample.csv.

Posts in the main dataset are formatted using HTML, meaning that our model will either have to be trained to recognise patterns such as paragraph breaks or we will have to clean the data and transform it into something more appropriate.

Furthermore, as we are only using a single set of features (post_number and topic_id) from the annotations dataset, we may be able to add another column to the posts dataset – a Boolean value representing whether or not a specific post contains cyberbullying. This will eliminate the need for the use of two separate datasets to develop our models.

Regarding the types of data we’re given, topic IDs, authors and HTML messages are categorical, while post numbers are ordinal. In the annotations data set, all values but post number are categorical.

summary(wow_posts_df)
##    topic_id          post_number      author          html_message      
##  Length:16978       Min.   :   0   Length:16978       Length:16978      
##  Class :character   1st Qu.: 226   Class :character   Class :character  
##  Mode  :character   Median : 708   Mode  :character   Mode  :character  
##                     Mean   :1366                                        
##                     3rd Qu.:2420                                        
##                     Max.   :4692                                        
##                     NA's   :3                                           
##   timestamp        
##  Length:16978      
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 
nrow(wow_posts_df)
## [1] 16978

There are 16978 data points.

wow_posts_df %>% select("topic_id") -> wow_topics #gets all unique wow_topics from the dataset
nrow(unique(wow_topics))
## [1] 23

This data set has a total of 23 different topics.

Step 3: Cleaning the data / Pre-processing

For pre-processing, we decided to merge the datasets as they are essentially identical to each other in terms of structure and we believed that this would simplify data pre-processing and remove code duplication. As we merged the datasets, we found that we could create a “is_bullying” column that would be used as a target variable for the model building process. We also created a “bullying_severity” column as each post can have multiple annotators for each cyberbullying post.

After that, we extracted the messages out of the HTML in order to then pre-process the messages through stemming, removing punctuation and removing stopwords (the messages are HTML as both datasets come from web forums). We then extracted a “word_counts” column for EDA purposes as well as sampling the data in order to balance it out. We then created training and test datasets for model building where each dataset is a document term matrix as we cannot pass in raw text data to the models. Finally, we export the relevant data (including clean data).

# Creates dataset column to merge posts and annotations csv files together
wow_posts_df$dataset <- "WoW"
lol_posts_df$dataset <- "LoL"
wow_annotations_df$dataset <- "WoW"
lol_annotations_df$dataset <- "LoL"

posts_df <- rbind(wow_posts_df, lol_posts_df)
annotations_df <- rbind(lol_annotations_df, wow_annotations_df)

Since wow_posts_df and lol_posts_df have the same structure, we merged the posts and annotation data frames together to simplify pre-processing (this avoids repeating code). However, we will need to analyse the datasets separately for EDA purposes so we created a “dataset” feature to counteract this.

posts_df$id <- paste(posts_df$dataset, posts_df$topic_id, posts_df$post_number, sep="_")
annotations_df$id <- paste(annotations_df$dataset, annotations_df$topic_id, annotations_df$post_number, sep="_")

To simplify the merging of data frames, we will create an ID column so that a left join can be performed on a single column. This mitigates the issues of duplicate topic ids and post numbers as the post numbers are only unique according to the topic id.

merged_df <- left_join(posts_df, annotations_df, by = "id", keep=TRUE)
merged_df$is_bullying <- as.integer(!is.na(merged_df["id.y"]))
drop <- c('topic_id.y', 'post_number.y', 'dataset.y', 'id.y', 'offender', 'victim')
merged_df <- merged_df[, !(names(merged_df) %in% drop)]

# Removes the ".x" characters from the remaining annotations columns
colnames(merged_df) = sub(".x", "", colnames(merged_df))

# Create bullying_severity column
names(merged_df)[names(merged_df) == "annotator"] <- "bullying_severity"
merged_df["bullying_severity"][is.na(merged_df["bullying_severity"])] <- 0
posts_df <- merged_df %>% group_by(id) %>% slice(which.max(bullying_severity))

This code performs a left join to merge the dataframes together. Most of the columns from the annotations dataframe are useless for training an NLP classifier so we will be removing those columns. Since there are duplicate columns on each side, we will be dropping “y” columns.

We also created a bullying_severity column as we found that some posts have been annotated as bullying by multiple annotators which could make this a useful feature for model building.

remove_html <- function(html_msg, isHtml) {
  if(isHtml) {
    # Remove backslashes when dealing with LoL forum data
    html_msg <- gsub("\\\\", '', html_msg)
    # Get XML nodes
    msg <- xml2::read_html(html_msg)
    # Get the block quotes and quotes (blockquotes for WoW, .quote for LoL)
    blockquotes <- msg %>% html_nodes("blockquote")
    quotes <- msg %>% html_nodes(".quote")
    
    # Remove quote elements for LoL and WoW datasets
    xml_remove(blockquotes)
    xml_remove(quotes)
    msg <- html_text(msg)
    return(msg)
  }
  return(html_msg)
}

The “html_message” column has messages that have HTML and do not contain HTML at all. In order to handle this, we will be creating a “is_html” column that uses a regular expression to detect HTML in order to prevent errors with RVest. The “tm” package does not handle removing HTML content and we cannot simply use a regular expression to remove HTML as the data originates from gaming forums where ”

” elements are frequently used. If we used a regular expression then the content inside the blockquotes would still remain.

To remove the content of the blockquotes, we used RVest to acquire the blockquote element contents as well as any

elements with “.quote” and then we use xml_remove() to remove the blockquote element nodes. We then convert the RVest object back into a string.

# Regex for detecting HTML
detect_html_regex <- "<.*?>"
# Create is_html column
posts_df$is_html <- str_detect(posts_df$html_message, detect_html_regex)
# Apply remove_html function to html_message
posts_df$html_message <- mapply(remove_html, posts_df$html_message, posts_df$is_html)
posts_df <- posts_df[, !(names(posts_df) %in% 'is_html')]

# Converts any regex passed into the transformer into a space character
toSpaceTransformer <- content_transformer(function (x, pattern) gsub(pattern, "", x))
posts_corpus <- Corpus(VectorSource(posts_df$html_message))
posts_corpus <- posts_corpus %>% 
                tm_map(content_transformer(tolower)) %>%
                tm_map(toSpaceTransformer, "http\\S+\\s*") %>%
                tm_map(removeNumbers) %>%
                tm_map(removeWords, stopwords("english")) %>%
                tm_map(removePunctuation) %>%
                tm_map(stemDocument) %>%
                tm_map(stripWhitespace)
posts_df$html_message <- data.frame(text=sapply(posts_corpus, identity), stringsAsFactors = F)$text

This code removes useless characters, stopwords, punctuation and it uses stemming to improve model performance. Certain steps of the pre-processing could be tweaked to improve model performance (e.g. number of stopwords being omitted) as the pre-processing could end up being too rigorous.

We removed the HTML characters first in order to prevent interference when removing punctuation or whitespace.

posts_df$word_counts <- str_count(posts_df$html_message, "\\S+")

This code gets the word counts for the html messages which can be used for analysing word counts in the EDA. We may also use the word counts to filter messages with word counts that are too low.

posts_df <- posts_df %>% na_if("") %>% na.omit

This code removes NaN rows from posts_df which can become a problem after pre-processing if there were too many stop words in the original messages.

write.csv(posts_df, file="Data/clean_posts.csv")

This code exports the clean posts to a csv file to be analysed separately. This also comes in handy in order to save time when performing EDAs as pre-processing can take time (especially on slow computers).

corpus = VCorpus(VectorSource(posts_df$html_message))
dtm = DocumentTermMatrix(corpus)
dtm = removeSparseTerms(dtm, 0.999)
posts_data = as.data.frame(as.matrix(dtm))
posts_data$is_bullying = as.factor(posts_df$is_bullying)

We create a document term matrix from the html messages and we remove sparse terms (i.e. empty values) using removeSparseTerms. We then assign a “is_bullying” column for model building and EDA purposes.

As we can see, the data is heavily imbalanced where there isn’t many bullying cases. This will result in the classifier being trained to where it is more accurate at classifying non-bullying cases rather than bullying cases. We will use undersampling because we have plenty of non-bullying data but not enough data for bullying cases (this means we can afford to reduce how much data we are dealing with).

is_bullying = which(posts_data$is_bullying == 1)
not_bullying = which(posts_data$is_bullying == 0)
nsamp = min(length(is_bullying), length(not_bullying))
sample_bullying = sample(is_bullying, nsamp)
sample_not_bullying = sample(not_bullying, nsamp)
posts_data_balanced = posts_data[c(sample_bullying, sample_not_bullying),]

This creates a sample of the bullying data for balancing purposes. However, this comes at the expense of having much less data to work with as there are significantly less cyberbullying cases versus non-cyberbullying cases.

set.seed(42)
part <- sample(2, nrow(posts_data), replace=TRUE, prob=c(0.6, 0.4))
train <- posts_data[part == 1, ]
test <- posts_data[part == 2, ]

We split the data using a 60:40 train-test split.

set.seed(42)
part <- sample(2, nrow(posts_data_balanced), replace=TRUE, prob=c(0.6, 0.4))
train_balanced <- posts_data_balanced[part == 1, ]
test_balanced <- posts_data_balanced[part == 2, ]

We split the data using a 60:40 split. This is for the balanced data.

write.csv(posts_data_balanced, file="Data/clean_posts_dtm_balanced_sample.csv")
write.csv(train_balanced, file="Data/train_balanced.csv")
write.csv(test_balanced, file="Data/test_balanced.csv")
write.csv(train, file="Data/train.csv")
write.csv(test, file="Data/test.csv")
write.csv(posts_data, file="Data/clean_posts_dtm.csv")

We export the training and test data to make steps such as model building and EDA easier and more convenient as it can take time to pre-process the data (especially on slower computers).

Step 4: EDA

League of Legends analysis

posts_df %>% filter(dataset == "LoL") -> lol_posts
lol_posts %>% filter(is_bullying == 1) -> bullying_lol
bullying_lol
nrow(bullying_lol) / nrow(lol_posts) * 100
## [1] 1.530429

For this data set 1.53% of the posts are labeled as bullying.

lol_tibble <- tibble(txt = lol_posts$html_message)
lol_tibble #transforming the HTML messages into a tibble for an easier workflow
lol_tibble <- lol_tibble%>% 
    mutate(linenumber = row_number()) %>%
 unnest_tokens(word, txt) %>% anti_join(stop_words)
lol_tibble #splitting tibble by words
lol_counts <- lol_tibble %>% count(word, sort=TRUE)
lol_counts #sorting words by count
wordcloud(lol_counts$word, lol_counts$n, max.words = 250,
          min.freq=25, random.order=FALSE, colors=brewer.pal(8, "Dark2"))

  #creating a word cloud out of sorted list

By using a word cloud to analyse the HTML messages from the League of Legends boards, we have gained several insights into the issue of cyberbullying within the game.

The word helped us to identify the most common words or phrases used in the chat. This allowed us to see if certain language or terminology was frequently used in a negative or bullying context. For example, the words “stupid” and “idiot” were often marked as cyberbullying, which could indicate that players were using those words to mock or belittle others.

lol_bigrams <- tibble(txt = lol_posts$html_message) %>% 
  unnest_tokens(bigram, txt, token = "ngrams", n = 2)
lol_bigrams
#split original tibble into two-word bigrams
lol_bigrams <- lol_bigrams %>%
      separate(bigram, c("word1", "word2"), sep = " ") 
lol_bigrams #separating them out for easier cleaning
lol_bigrams <- lol_bigrams %>%
      filter(!word1 %in% stop_words$word) %>%
      filter(!word2 %in% stop_words$word)
lol_bigrams #filtering out unwanted words
lol_bigrams <- lol_bigrams %>%
            filter(!is.na(word1)) %>% 
            filter(!is.na(word2))
lol_bigrams #removing null values
lol_bigrams <- lol_bigrams %>%
            unite(bigram, word1, word2, sep=" ")
lol_bigrams #joining the words back together
lol_bigram_counts <- lol_bigrams %>% count(bigram, sort=TRUE)
lol_bigram_counts #counting and sorting bigrams
lol_bigram_counts %>% 
  filter(str_detect(lol_bigram_counts$bigram,"[0-9]", negate = TRUE)) -> lol_bigram_counts
lol_bigram_counts #removing any bigrams with numbers
lol_filtered_bigrams <- lol_bigram_counts %>%
                  filter(n >= 4)
lol_filtered_bigrams #sorting remaining bigrams with a frequency of 4 or more
lol_separated_bigrams <- lol_filtered_bigrams %>% 
  select("bigram") %>%
  separate(bigram, c("word1", "word2"), sep = " ") 
lol_separated_bigrams #separating bigrams again, preparing for graphical representation
lol_bigram_graph <- lol_separated_bigrams %>%
                  graph_from_data_frame()
lol_bigram_graph  #creating bigram graph
## IGRAPH d3b2548 DN-- 1761 6031 -- 
## + attr: name (v/c)
## + edges from d3b2548 (vertex names):
##  [1] lp      ->ffsgive  late    ->game     twin    ->fang     play    ->game    
##  [5] earli   ->game     game    ->mode     play    ->dominion rank    ->queue   
##  [9] play    ->rank     leagu   ->legend   peopl   ->play     poison  ->mage    
## [13] rank    ->dominion summon  ->rift     rank    ->game     dominion->player  
## [17] player  ->base     game    ->play     share   ->account  account ->share   
## [21] twist   ->treelin  lane    ->bulli    win     ->rate     loss    ->prevent 
## [25] mana    ->cost     ap      ->ratio    enemi   ->team     lane    ->phase   
## [29] everi   ->singl    fortun  ->teller   noxious ->blast    everi   ->time    
## + ... omitted several edges
ggraph(lol_bigram_graph, layout = "fr") +
  geom_edge_link() +
  geom_node_point() +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1)

For this project we used a bigram representation to quickly visualize the most common two-word phrases in this large text dataset. This helped us to get a sense of the collocation of words in the data, which gave important insights into the meaning and context of the text.

Additionally, using a bigram representation allowed us to easily identify any unusual or unexpected combinations of words that might be present in the dataset. This was particularly useful when working with unstructured data, as it allowed us to quickly see into some of the relationships between words within the text. By identifying the most common two-word phrases, we were able to see how words were collocating in the dataset, which gave us an idea of how they were being used and the relationships between them. This helped us to identify patterns and themes in the data that would have been difficult to discern by just looking at individual words.

In summary, using a bigram representation was a valuable tool for our exploratory data analysis, as it helped us to quickly gain a better understanding of our dataset by identifying patterns and themes, and also helped us to identify any unusual or unexpected combinations of words that might be present in the HTML message data.

World of Warcraft analysis

wow_posts <- posts_df %>% filter(dataset == "WoW")
posts_df %>% filter(is_bullying == 1) -> bullying_wow
bullying_wow
nrow(bullying_wow) / nrow(wow_posts) * 100
## [1] 2.437285

For this data set 2.4% of the posts are labeled as bullying.

wow_tibble <- tibble(txt = wow_posts$html_message)
wow_tibble #transforming the HTML messages into a tibble for an easier workflow
wow_tibble <- wow_tibble%>% 
    mutate(linenumber = row_number()) %>%
 unnest_tokens(word, txt) %>% anti_join(stop_words)
## Joining, by = "word"
wow_tibble #splitting tibble by words and removing stop words
wow_counts <- wow_tibble %>% count(word, sort=TRUE)
wow_counts #sorting and counting the remaining words
wordcloud(wow_counts$word, wow_counts$n, max.words = 250,
          min.freq=25, random.order=FALSE, colors=brewer.pal(8, "Dark2"))

  #creating word cloud

We used a word cloud to quickly visualise the most common words in our large text dataset. This helped us to get a sense of the overall topic and context of the data, as well as identify any patterns or themes that might be present.

Additionally, using a word cloud allowed us to easily identify any outliers or unusual words that might be present in the dataset, which could then be further investigated. This was particularly useful when working with unstructured data, as it allowed us to quickly gain insights without having to manually sift through all of the text. Using a word cloud was a valuable tool in our exploratory data analysis and helped us to quickly gain a better understanding of the dataset.

wow_bigrams <- tibble(txt = wow_posts$html_message) %>% 
  unnest_tokens(bigram, txt, token = "ngrams", n = 2)
wow_bigrams #split original tibble into two-word bigrams
wow_bigrams <- wow_bigrams %>%
      separate(bigram, c("word1", "word2"), sep = " ") 
wow_bigrams #separating them out for easier cleaning
wow_bigrams <- wow_bigrams %>%
      filter(!word1 %in% stop_words$word) %>%
      filter(!word2 %in% stop_words$word)
wow_bigrams #filtering out unwanted words
wow_bigrams <- wow_bigrams %>%
            filter(!is.na(word1)) %>% 
            filter(!is.na(word2))
wow_bigrams #removing null values
wow_bigrams <- wow_bigrams %>%
            unite(bigram, word1, word2, sep=" ")
wow_bigrams #joining the words back together
wow_bigram_counts <- wow_bigrams %>% count(bigram, sort=TRUE)
wow_bigram_counts #counting and sorting bigrams
wow_bigram_counts %>% 
  filter(str_detect(wow_bigram_counts$bigram,"[0-9]", negate = TRUE)) -> wow_bigram_counts
wow_bigram_counts #removing any bigrams with numbers
wow_filtered_bigrams <- wow_bigram_counts %>%
                  filter(n >= 4)
wow_filtered_bigrams #sorting remaining bigrams with a frequency of 4 or more
wow_separated_bigrams <- wow_filtered_bigrams %>% 
  select("bigram") %>%
  separate(bigram, c("word1", "word2"), sep = " ") 
wow_separated_bigrams #separating bigrams again, preparing for graphical representation
wow_bigram_graph <- wow_separated_bigrams %>%
                  graph_from_data_frame()
wow_bigram_graph #creating bigram graph
## IGRAPH f3ff41d DN-- 1898 8398 -- 
## + attr: name (v/c)
## + edges from f3ff41d (vertex names):
##  [1] connect->realm       realm  ->connect     pvp    ->realm      
##  [4] play   ->game        pve    ->realm       mani   ->peopl      
##  [7] classic->server      vanilla->server      rp     ->realm      
## [10] lot    ->peopl       classic->realm       fli    ->mount      
## [13] popul  ->realm       peopl  ->play        argent ->dawn       
## [16] hellfir->hellfir     low    ->popul       confirm->kill       
## [19] play   ->wow         vanilla->realm       max    ->level      
## [22] real   ->life        world  ->warcraft    defia  ->brotherhood
## + ... omitted several edges
ggraph(wow_bigram_graph, layout = "fr") +
  geom_edge_link() +
  geom_node_point() +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1)
## Warning: Using the `size` aesthetic in this geom was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` in the `default_aes` field and elsewhere instead.

The bigram representation helped us to identify the most common two-word phrases used in the chat. This allowed us to see the distribution of words and understand how players were communicating with each other. For example, the phrase “commit suicide” can be seen in the bigram - messages with this phrase often were flagged as cyberbulling.

Step 5: Predictive modelling

Initial setup

train_balanced$is_bullying = as.factor(train_balanced$is_bullying)
test_balanced$is_bullying = as.factor(test_balanced$is_bullying)

train_control = trainControl(method = "cv", number = 5)

SVM

set.seed(42)

tic()

svm_model = caret::train(is_bullying~., data=train_balanced , method =  "svmLinear" , trControl = train_control, verbose=FALSE)

svm_toc <- toc(quiet=T)

svm_time_taken <- svm_toc$toc - svm_toc$tic
svm_pred_y = predict(svm_model, test_balanced)

MLP

set.seed(42)
tic()

mlp_model = caret::train(is_bullying~., data=train_balanced , method =  "mlp" , trControl = train_control, verbose=FALSE)
mlp_toc <- toc(quiet=T)

mlp_time_taken <- mlp_toc$toc - mlp_toc$tic
mlp_pred_y = predict(mlp_model, test_balanced)

glmnet

set.seed(42)

tic()

glmnet_model = caret::train(is_bullying~., data=train_balanced , method = "glmnet" , trControl = train_control, verbose=FALSE)
glmnet_toc <- toc(quiet=T)

glmnet_time_taken <- glmnet_toc$toc - glmnet_toc$tic
glmnet_pred_y = predict(glmnet_model, test_balanced)

Step 6: Hyperparameter Tuning

random_train_control_grid = caret::trainControl(method="cv", number=3, search="random")

We decided to go with a random grid search as it simplified the process of hyperparameter tuning as it removed any potential biases that may occur from an uneven distribution of knowledge for the different models trained (i.e. we thought that a random grid search would be much more consistent when comparing tuned models).

SVM Linear kernel

set.seed(42)

tic()

svm_linear_tuned = caret::train(is_bullying~ ., data = train_balanced, method = "svmLinear", trControl = random_train_control_grid, tuneLength=5, verbose=FALSE)

svm_linear_tuned_toc = toc(quiet=T)
svm_linear_tuned_time_taken <- svm_linear_tuned_toc$toc - svm_linear_tuned_toc$tic
svm_tuned_pred_y = predict(svm_linear_tuned, test_balanced)

SVM Polynomial kernel

set.seed(42)

tic()

svm_model_poly = caret::train(is_bullying~ ., data = train_balanced, method = "svmPoly", trControl = random_train_control_grid, tuneLength=5, verbose=FALSE)

svm_poly_toc = toc(quiet=T)

svm_poly_tuned_time_taken = svm_poly_toc$toc - svm_poly_toc$tic
svm_poly_pred_y = predict(svm_model_poly, test_balanced)

SVM RBF Kernel

set.seed(42)

tic()

svm_model_rbf = caret::train(is_bullying~ ., data = train_balanced, method = "svmRadial", trControl = random_train_control_grid, tuneLength=5, verbose=FALSE)

svm_rbf_toc = toc(quiet=T)

# summarizing the results
svm_rbf_tuned_time_taken = svm_rbf_toc$toc - svm_rbf_toc$tic
svm_rbf_pred_y = predict(svm_model_rbf, test_balanced)

MLP Tuned

set.seed(42)

tic()

mlp_model_tuned = caret::train(is_bullying~ ., data = train_balanced, method = "mlp", trControl = random_train_control_grid, tuneLength=5)

mlp_toc = toc(quiet=T)

mlp_tuned_time_taken = mlp_toc$toc - mlp_toc$tic
mlp_tuned_pred_y = predict(mlp_model_tuned, test_balanced)

GLMNET Tuned

set.seed(42)

tic()

glm_tuned_model = caret::train(is_bullying~., data=train_balanced , method = "glmnet" , trControl = random_train_control_grid, verbose=FALSE, tuneLength=5)
glm_tuned_toc <- toc(quiet=T)

glm_tuned_time_taken <- glm_tuned_toc$toc - glm_tuned_toc$tic
glm_tuned_pred_y = predict(glm_tuned_model, test_balanced)

Step 7: Evaluation

For this evaluation, we wanted a comprehensive evaluation of the models involved using key metrics such as recall, precision, f1 score, the time taken to train the models and more to give a better idea as to which models perform best. However, we think that the F1 score is the most important metric as we believe that a cyberbullying classifier should minimize the number of false positives (i.e. people who haven’t actually bullied anybody) and false negatives (i.e. bullies that haven’t been identified). The F1 score acts as a trade-off between precision and recall which is exactly what we need. The F1 score is also robust against uneven class distributions unlike accuracy which gives us flexibility in how we should tweak pre-processing.

We have utilised the following to create an effective evaluation: 1. Created confusion matrices that illustrate where misclassifications are taking place 2. Plotted ROC curves to display the model performance 3. Created a table of metrics showing the recall, precision, f1 score and the time taken to train the models We wanted to show how long it took to train each model as this would help identify production costs for each model (the models would take much longer to train on big batches of data).

Creating confusion matrices

svm_confusion_matrix <- caret::confusionMatrix(svm_pred_y, test_balanced$is_bullying)
mlp_confusion_matrix <- caret::confusionMatrix(mlp_pred_y, test_balanced$is_bullying)
glm_confusion_matrix <- caret::confusionMatrix(glmnet_pred_y, test_balanced$is_bullying)
svm_confusion_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 120  41
##          1  32 117
##                                           
##                Accuracy : 0.7645          
##                  95% CI : (0.7133, 0.8106)
##     No Information Rate : 0.5097          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.5294          
##                                           
##  Mcnemar's Test P-Value : 0.3491          
##                                           
##             Sensitivity : 0.7895          
##             Specificity : 0.7405          
##          Pos Pred Value : 0.7453          
##          Neg Pred Value : 0.7852          
##              Prevalence : 0.4903          
##          Detection Rate : 0.3871          
##    Detection Prevalence : 0.5194          
##       Balanced Accuracy : 0.7650          
##                                           
##        'Positive' Class : 0               
## 
mlp_confusion_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 105  34
##          1  47 124
##                                           
##                Accuracy : 0.7387          
##                  95% CI : (0.6861, 0.7867)
##     No Information Rate : 0.5097          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.4764          
##                                           
##  Mcnemar's Test P-Value : 0.1824          
##                                           
##             Sensitivity : 0.6908          
##             Specificity : 0.7848          
##          Pos Pred Value : 0.7554          
##          Neg Pred Value : 0.7251          
##              Prevalence : 0.4903          
##          Detection Rate : 0.3387          
##    Detection Prevalence : 0.4484          
##       Balanced Accuracy : 0.7378          
##                                           
##        'Positive' Class : 0               
## 
glm_confusion_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 143  48
##          1   9 110
##                                           
##                Accuracy : 0.8161          
##                  95% CI : (0.7684, 0.8577)
##     No Information Rate : 0.5097          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6339          
##                                           
##  Mcnemar's Test P-Value : 4.823e-07       
##                                           
##             Sensitivity : 0.9408          
##             Specificity : 0.6962          
##          Pos Pred Value : 0.7487          
##          Neg Pred Value : 0.9244          
##              Prevalence : 0.4903          
##          Detection Rate : 0.4613          
##    Detection Prevalence : 0.6161          
##       Balanced Accuracy : 0.8185          
##                                           
##        'Positive' Class : 0               
## 

When examining these confusion matrices, it would appear that the number of false positives (i.e. falsely classifying a post as cyberbullying) is the differentiating factor in terms of f1-score as the glmnet model has the least amount of false positives and it has the greatest f1-score (as we can see in the table of metrics below).

svm_linear_tuned_cf_matrix <- caret::confusionMatrix(svm_tuned_pred_y, test_balanced$is_bullying)
svm_poly_cf_matrix <- caret::confusionMatrix(svm_poly_pred_y, test_balanced$is_bullying)
svm_rbf_cf_matrix <- caret::confusionMatrix(svm_rbf_pred_y, test_balanced$is_bullying)

mlp_tuned_cf_matrix <- caret::confusionMatrix(mlp_tuned_pred_y, test_balanced$is_bullying)
glm_tuned_cf_matrix <- caret::confusionMatrix(glm_tuned_pred_y, test_balanced$is_bullying)

Here we create confusion matrices to get values such as F1 score, precision and recalls.

svm_linear_tuned_cf_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 124  43
##          1  28 115
##                                           
##                Accuracy : 0.771           
##                  95% CI : (0.7201, 0.8166)
##     No Information Rate : 0.5097          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.5426          
##                                           
##  Mcnemar's Test P-Value : 0.09661         
##                                           
##             Sensitivity : 0.8158          
##             Specificity : 0.7278          
##          Pos Pred Value : 0.7425          
##          Neg Pred Value : 0.8042          
##              Prevalence : 0.4903          
##          Detection Rate : 0.4000          
##    Detection Prevalence : 0.5387          
##       Balanced Accuracy : 0.7718          
##                                           
##        'Positive' Class : 0               
## 
svm_poly_cf_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 131  44
##          1  21 114
##                                           
##                Accuracy : 0.7903          
##                  95% CI : (0.7407, 0.8343)
##     No Information Rate : 0.5097          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5817          
##                                           
##  Mcnemar's Test P-Value : 0.006357        
##                                           
##             Sensitivity : 0.8618          
##             Specificity : 0.7215          
##          Pos Pred Value : 0.7486          
##          Neg Pred Value : 0.8444          
##              Prevalence : 0.4903          
##          Detection Rate : 0.4226          
##    Detection Prevalence : 0.5645          
##       Balanced Accuracy : 0.7917          
##                                           
##        'Positive' Class : 0               
## 
svm_rbf_cf_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 118  46
##          1  34 112
##                                           
##                Accuracy : 0.7419          
##                  95% CI : (0.6894, 0.7897)
##     No Information Rate : 0.5097          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.4845          
##                                           
##  Mcnemar's Test P-Value : 0.2188          
##                                           
##             Sensitivity : 0.7763          
##             Specificity : 0.7089          
##          Pos Pred Value : 0.7195          
##          Neg Pred Value : 0.7671          
##              Prevalence : 0.4903          
##          Detection Rate : 0.3806          
##    Detection Prevalence : 0.5290          
##       Balanced Accuracy : 0.7426          
##                                           
##        'Positive' Class : 0               
## 
mlp_tuned_cf_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 121  45
##          1  31 113
##                                          
##                Accuracy : 0.7548         
##                  95% CI : (0.703, 0.8017)
##     No Information Rate : 0.5097         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.5103         
##                                          
##  Mcnemar's Test P-Value : 0.1359         
##                                          
##             Sensitivity : 0.7961         
##             Specificity : 0.7152         
##          Pos Pred Value : 0.7289         
##          Neg Pred Value : 0.7847         
##              Prevalence : 0.4903         
##          Detection Rate : 0.3903         
##    Detection Prevalence : 0.5355         
##       Balanced Accuracy : 0.7556         
##                                          
##        'Positive' Class : 0              
## 

It would appear that tuning MLP parameters through random grid search has actually decreased the accuracy of the model.

glm_tuned_cf_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 143  48
##          1   9 110
##                                           
##                Accuracy : 0.8161          
##                  95% CI : (0.7684, 0.8577)
##     No Information Rate : 0.5097          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6339          
##                                           
##  Mcnemar's Test P-Value : 4.823e-07       
##                                           
##             Sensitivity : 0.9408          
##             Specificity : 0.6962          
##          Pos Pred Value : 0.7487          
##          Neg Pred Value : 0.9244          
##              Prevalence : 0.4903          
##          Detection Rate : 0.4613          
##    Detection Prevalence : 0.6161          
##       Balanced Accuracy : 0.8185          
##                                           
##        'Positive' Class : 0               
## 

ROC Curves

plot_roc_curves <- function(model_1_pred_y, model_2_pred_y, model_3_pred_y, model_names) {
  model_1_pred_y <- as.numeric(levels(model_1_pred_y))[model_1_pred_y]
  model_2_pred_y <- as.numeric(levels(model_2_pred_y))[model_2_pred_y]
  model_3_pred_y <- as.numeric(levels(model_3_pred_y))[model_3_pred_y]

  par(pty="s")
  model_1_roc <- roc(test_balanced$is_bullying~model_1_pred_y, plot=TRUE, print.auc=TRUE, col="red", lwd=4, legacy.axes=TRUE, main="ROC Curves")
  model_2_roc <- roc(test_balanced$is_bullying~model_2_pred_y, plot=TRUE, print.auc=TRUE, print.auc.y=0.4, col="blue", lwd=4, add=TRUE)
  model_3_roc <- roc(test_balanced$is_bullying, model_3_pred_y, plot=TRUE, print.auc=TRUE, print.auc.y=0.6, col="green", lwd=4, add=TRUE)
  
  legend("bottomright", legend=model_names, col=c("red", "blue", "green"), lwd=4)
}

Non-tuned models

plot_roc_curves(svm_pred_y, mlp_pred_y, glmnet_pred_y, c("SVM", "MLP", "GLMNET"))

Comparing tuned SVM models

plot_roc_curves(svm_tuned_pred_y, svm_poly_pred_y, svm_rbf_pred_y, c("SVM Linear", "SVM Poly", "SVM RBF"))

It would appear that the polynomial kernel is most suitable for bullying classification. We will compare this SVM model that uses the polynomial kernel with other models.

Tuned models

plot_roc_curves(svm_poly_pred_y, mlp_tuned_pred_y, glm_tuned_pred_y, c("SVM Poly", "MLP Tuned", "GLMNET"))

This code creates ROC curves for each model with different colours. We then print the AUC values for each curve.

Table of metrics

For each model, we will be calculating the evaluation metrics we will be using (time taken, precision, sensitivity, f1 score, AUC) to create a table of metrics which can be used to evaluate each model. We are acquiring the time taken to train each model to see how practical the models would be if the models were to be trained continuously with new data in terms of computation. While it doesn’t take that long to train the models with the sample data, it would take much longer if there was more data.

create_table <- function(prediction_list, time_to_train_c) {
  Model_Name = names(prediction_list)
  Precision = c()
  Recall = c()
  F1_Score = c()
  AUC = c()
  Time_to_Train_secs = time_to_train_c
  
  for (prediction_y in prediction_list) {
    Precision <- append(Precision, precision(prediction_y, test_balanced$is_bullying))
    Recall <- append(Recall, recall(prediction_y, test_balanced$is_bullying))
    F1_Score <- append(F1_Score, F_meas(prediction_y, test_balanced$is_bullying))
    AUC <- append(AUC, auc(test_balanced$is_bullying, as.integer(prediction_y)))
  }
  
  results <- data.frame(Model_Name, Precision, Recall, F1_Score, AUC, Time_to_Train_secs)
  return(results)
}
prediction_list = list(svm_linear = svm_tuned_pred_y, svm_poly = svm_poly_pred_y, svm_rbf = svm_rbf_pred_y, mlp = mlp_tuned_pred_y, glmnet = glm_tuned_pred_y)
time_to_train = c(svm_time_taken, svm_poly_tuned_time_taken, svm_rbf_tuned_time_taken, mlp_time_taken, glmnet_time_taken)
results = create_table(prediction_list, time_to_train)
results

From these results, it would appear that the glmnet logistic regression model has the greatest performance as it has the greatest F1 score (81%) and AUC score. It is also relatively quick to train which can reduce costs if deployed in a large-scale system where the model may be continuously trained with new data. However, the SVM model with a polynomial kernel appears to be slightly better in terms of recall than the glmnet model. However, this difference is very small to where it won’t really amount to anything.

We can also see that the mlp model performed the worst as it took a long time to train on a relatively small sample of data (almost 2 minutes) and it has the worst F1 and AUC scores where the F1 score was 66% whilst the AUC was 69%. ### Output evaluation results

write.csv(results, "Data/results.csv")

We output the table results to a csv file to make it easy to share results with other group members.

Individual contributions

Thomas Newton I was responsible for predictive modelling and hyperparameter tuning. I also created the problem statements and references.

Leon Harper INSERT HERE

Michal Jedruszczak In this project I was responsible for the preliminary EDA and the final EDA. I also created the contents section and made sure that all of the references were in MMU Harvard format.

References

Anti-Defamation League (2022) Hate is No Game: Harassment and Positive Social Experiences in Online Games 2021 [Online][Accessed: November 19, 2022] https://www.adl.org/resources/report/hate-no-game-harassment-and-positive-social-experiences-online-games-2021

Bretschneider, U. and Peters, R. “Detecting Cyberbullying in Online Communities” (2016). Research Papers. Paper 61. http://aisel.aisnet.org/ecis2016_rp/61